In this paper, we are interested in learning a generalizable person re-identification (re-ID) representation from unlabeled videos. Compared with 1) the popular unsupervised re-ID setting where the training and test sets are typically under the same domain, and 2) the popular domain generalization (DG) re-ID setting where the training samples are labeled, our novel scenario combines their key challenges: the training samples are unlabeled, and collected form various domains which do no align with the test domain. In other words, we aim to learn a representation in an unsupervised manner and directly use the learned representation for re-ID in novel domains. To fulfill this goal, we make two main contributions: First, we propose Cycle Association (CycAs), a scalable self-supervised learning method for re-ID with low training complexity; and second, we construct a large-scale unlabeled re-ID dataset named LMP-video, tailored for the proposed method. Specifically, CycAs learns re-ID features by enforcing cycle consistency of instance association between temporally successive video frame pairs, and the training cost is merely linear to the data size, making large-scale training possible. On the other hand, the LMP-video dataset is extremely large, containing 50 million unlabeled person images cropped from over 10K Youtube videos, therefore is sufficient to serve as fertile soil for self-supervised learning. Trained on LMP-video, we show that CycAs learns good generalization towards novel domains. The achieved results sometimes even outperform supervised domain generalizable models. Remarkably, CycAs achieves 82.2% Rank-1 on Market-1501 and 49.0% Rank-1 on MSMT17 with zero human annotation, surpassing state-of-the-art supervised DG re-ID methods. Moreover, we also demonstrate the superiority of CycAs under the canonical unsupervised re-ID and the pretrain-and-finetune scenarios.
translated by 谷歌翻译
我们提出了一个我们命名肖像解释的任务,并为其构建一个名为Portrait250k的数据集。当前关于人类属性认可和人重新识别等肖像的研究取得了许多成功,但通常,它们:1)可能缺乏各种任务与可能带来的可能利益之间的相互关系; 2)专门为每个任务设计的深层模型,这效率低下; 3)可能无法满足统一模型的需求和实际场景中的全面感知。在本文中,拟议的肖像解释从新的系统角度认识到人类的感知。我们将肖像的感知分为三个方面,即外观,姿势和情感,以及设计相应的子任务。基于多任务学习的框架,肖像解释需要对静态属性和肖像的动态状态进行全面描述。为了激发有关这项新任务的研究,我们构建了一个新数据集,其中包含25万张图像,上面标有身份,性别,年龄,体质,身高,表达和整个身体和手臂的姿势。我们的数据集是从51部电影中收集的,因此涵盖了广泛的多样性。此外,我们专注于表示肖像解释的表示,并提出了反映我们系统观点的基线。我们还为此任务提出了适当的指标。我们的实验结果表明,结合与肖像解释有关的任务可以产生好处。代码和数据集将公开。
translated by 谷歌翻译
6-DOF GRASP姿势检测多盖和多对象是智能机器人领域的挑战任务。为了模仿人类的推理能力来抓住对象,广泛研究了数据驱动的方法。随着大规模数据集的引入,我们发现单个物理度量通常会产生几个离散水平的掌握置信分数,这无法很好地区分数百万的掌握姿势并导致不准确的预测结果。在本文中,我们提出了一个混合物理指标来解决此评估不足。首先,我们定义一个新的度量标准是基于力闭合度量的,并通过对象平坦,重力和碰撞的测量来补充。其次,我们利用这种混合物理指标来产生精致的置信度评分。第三,为了有效地学习新的置信度得分,我们设计了一个称为平面重力碰撞抓氏(FGC-Graspnet)的多分辨率网络。 FGC-GRASPNET提出了多个任务的多分辨率特征学习体系结构,并引入了新的关节损失函数,从而增强了GRASP检测的平均精度。网络评估和足够的实际机器人实验证明了我们混合物理指标和FGC-GraspNet的有效性。我们的方法在现实世界中混乱的场景中达到了90.5 \%的成功率。我们的代码可在https://github.com/luyh20/fgc-graspnet上找到。
translated by 谷歌翻译
基于聚类的无监督域自适应(UDA)人重新识别(Reid)可减少详尽的注释。然而,由于嵌入不良的功能嵌入和不完美的聚类,目标域数据的伪标签本身包含错误的错误比例,这将误导特色。在本文中,我们提出了一种名为概率不确定性的方法,用于域自适应人员重新识别域的概率不确定性引导逐行标签炼油厂(P $ ^ 2 $ LR)。首先,我们建议将标记不确定性与概率距离一起模拟,以及理想的单峰分布。建立定量标准以测量伪标签的不确定性,并促进网络培训。其次,我们探索精炼伪标签的渐进战略。凭借不确定性引导的替代优化,我们在目标域数据探索与嘈杂标签的负面影响之间平衡。在强大的基线之上,我们获得了重大改进,实现了四个UDA Reid基准的最先进的表现。具体而言,我们的方法在Duke2market任务上占据了6.5%地图的基线,同时超过了最先进的方法,在Market2MSMT任务上将最先进的方法映射到2.5%地图。
translated by 谷歌翻译
多目标多摄像机跟踪(MTMCT)中的数据关联通常从重新识别(RE-ID)特征距离直接估计亲和力。但是,我们认为它可能不是最佳选择,因为匹配范围与MTMCT问题之间的匹配范围差异。重新ID系统专注于全局匹配,从而从所有相机和常规检索目标。相反,跟踪中的数据关联是一个本地匹配问题,因为其候选者仅来自相邻位置和时间框架。在本文中,我们设计实验,以验证全局重新ID功能距离和本地匹配在跟踪中的本地匹配之间的这种错误,并提出了一种简单但有效的方法来适应MTMCT中的相应匹配范围。我们不是尝试处理所有外观变化,而不是在数据关联期间专门调整关联度量来专门化。为此,我们介绍了一种新的数据采样方案,其中包含用于跟踪中的数据关联的时间窗口。自适应亲和模块最小化不匹配,对全局重新ID距离具有显着的改进,并在CityFlow和DukemTMC数据集中生成竞争性能。
translated by 谷歌翻译
半监督学习旨在利用大量未标记的数据进行性能提升。现有工作主要关注图像分类。在本文中,我们深入了解对象检测的半监督学习,其中标记的数据更加劳动密集。目前的方法是由伪标签产生的嘈杂区域分散注意力。为了打击嘈杂的标签,我们通过量化区域不确定性提出抗噪声的半监督学习。我们首先调查与伪标签相关的不同形式的噪声带来的不利影响。然后,我们建议通过识别不同强度的区域的抗性特性来量化区域的不确定性。通过导入该地区不确定性量化和促进多跳概率分布输出,我们将不确定性引入训练和进一步实现抗噪声学习。 Pascal VOC和MS COCO两者的实验证明了我们的方法的特殊表现。
translated by 谷歌翻译
跟踪视频感兴趣的对象是计算机视觉中最受欢迎和最广泛应用的问题之一。然而,随着年的几年,寒武纪的用例和基准已经将问题分散在多种不同的实验设置中。因此,文献也已经分散,现在社区提出的新方法通常是专门用于仅适合一个特定的设置。要了解在多大程度上,这项专业化是必要的,在这项工作中,我们展示了UnitRack,一个解决方案来解决同一框架内的五个不同任务。 Unitrack由单一和任务不可知的外观模型组成,可以以监督或自我监督的方式学习,以及解决个人任务的多个`“头”,并且不需要培训。我们展示了在该框架内可以解决的大多数跟踪任务,并且可以成功地成功地使用相同的外观模型来获得对针对考虑大多数任务的专业方法具有竞争力的结果。该框架还允许我们分析具有最新自我监督方法获得的外观模型,从而扩展了他们的评估并与更大种类的重要问题进行比较。
translated by 谷歌翻译
Employing part-level features for pedestrian image description offers fine-grained information and has been verified as beneficial for person retrieval in very recent literature. A prerequisite of part discovery is that each part should be well located. Instead of using external cues, e.g., pose estimation, to directly locate parts, this paper lays emphasis on the content consistency within each part.Specifically, we target at learning discriminative partinformed features for person retrieval and make two contributions. (i) A network named Part-based Convolutional Baseline (PCB). Given an image input, it outputs a convolutional descriptor consisting of several part-level features. With a uniform partition strategy, PCB achieves competitive results with the state-of-the-art methods, proving itself as a strong convolutional baseline for person retrieval. (ii) A refined part pooling (RPP) method. Uniform partition inevitably incurs outliers in each part, which are in fact more similar to other parts. RPP re-assigns these outliers to the parts they are closest to, resulting in refined parts with enhanced within-part consistency. Experiment confirms that RPP allows PCB to gain another round of performance boost. For instance, on the Market-1501 dataset, we achieve (77.4+4.2)% mAP and (92.3+1.5)% rank-1 accuracy, surpassing the state of the art by a large margin.
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Knowledge graphs (KG) have served as the key component of various natural language processing applications. Commonsense knowledge graphs (CKG) are a special type of KG, where entities and relations are composed of free-form text. However, previous works in KG completion and CKG completion suffer from long-tail relations and newly-added relations which do not have many know triples for training. In light of this, few-shot KG completion (FKGC), which requires the strengths of graph representation learning and few-shot learning, has been proposed to challenge the problem of limited annotated data. In this paper, we comprehensively survey previous attempts on such tasks in the form of a series of methods and applications. Specifically, we first introduce FKGC challenges, commonly used KGs, and CKGs. Then we systematically categorize and summarize existing works in terms of the type of KGs and the methods. Finally, we present applications of FKGC models on prediction tasks in different areas and share our thoughts on future research directions of FKGC.
translated by 谷歌翻译